Geospatial analysis provides a distinct perspective on the world, a unique lens through which to examine events, patterns, and processes that operate on or near the surface of our planet. Ultimately geospatial analysis concerns what happens where, and makes use of geographic information that links features and phenomena on the Earth’s surface to their locations.
We can talk about a few different concepts when it comes to spatial information. These are:
At the center of all spatial analysis is the concept of place. People identify with places of various sizes and shapes, from the room with the parcel of land, to the neighbourhood, to the city, the country, the state or the nation state. Plcaes often have names, and people use these to talk about and distinguesh names. Names can be official. Places also change continually as people move. The basis of rigorous and precise definition of place is a coordinate system, a set of measurements that allows place to be specified unambiguously and in a way that is meaningful to everyone.
Attribute has become the preferred term for any recorded characteristoc or property of a place. A place’s name is an obvious example of an attribute. But there can be other pices of information, such as numer of crimes in a neighbourhood, or the GDP of a country. Within GIS the term ‘attributes’ usually refers to records in a data table associated with individual features in a vector map or cells in a grid (raster or image file). These data behave exactly as data you have encountered in your data analysis courses. The rows represent observations, and the columns represent variables. The variables can be numeric or categorical, and depending on what they are, you can apply different methods to making sense of them.
In spatial analysis it is customary to refer to places as objects. These objects can be a whole country, or a road. In studies of climate change, the objects of interest might be weather stations of minimal extent, and will be represented as points. On the other hand, studies of social or economic patterns may need to consider the two-dimenstional extent of places, which will therefore be represented as areas. These representations of the world are part of what is called the vector data model: A representation of the world using points, lines, and polygons. Vector models are useful for storing data that have discrete boundaries, such as country borders, land parcels, and streets. This is made up of points, lines, and areas (polygons):
Objects can also be Raster data. Raster data is made up of pixels (or cells), and each pixel has an associated value. Simplifying slightly, a digital photograph is an example of a raster dataset where each pixel value corresponds to a particular colour. In GIS, the pixel values may represent elevation above sea level, or chemical concentrations, or rainfall etc. The key point is that all of this data is represented as a grid of (usually square) cells.
Historically maps have been the primary means to store and communicate spatial data. Objects and their attributes can be readily depicted, and the human eye can quickly discern patterns and anomalies in a well-designed map.
Map projections try to portray the surface of the earth or a portion of the earth on a flat piece of paper or computer screen. A coordinate reference system (CRS) then defines, with the help of coordinates, how the two-dimensional, projected map in your GIS is related to real places on the earth. The decision as to which map projection and coordinate reference system to use, depends on the regional extent of the area you want to work in, on the analysis you want to do and often on the availability of data.
A traditional method of representing the earth’s shape is the use of globes. When viewed at close range the earth appears to be relatively flat. However when viewed from space, we can see that the earth is relatively spherical. Maps, are representations of reality. They are designed to not only represent features, but also their shape and spatial arrangement. Each map projection has advantages and disadvantages. The best projection for a map depends on the scale of the map, and on the purposes for which it will be used. For your purposes, you just need to understand that essentially there are different ways to flatten out the earth, in order to get it into a 2-dimensional map.
The process of creating map projections can be visualised by positioning a light source inside a transparent globe on which opaque earth features are placed. Then project the feature outlines onto a two-dimensional flat piece of paper. Different ways of projecting can be produced by surrounding the globe in a cylindrical fashion, as a cone, or even as a flat surface. Each of these methods produces what is called a map projection family. Therefore, there is a family of planar projections, a family of cylindrical projections, and another called conical projections see figure_projection_families
figure_projection_families
With the help of coordinate reference systems (CRS) every place on the earth can be specified by a set of three numbers, called coordinates. In general CRS can be divided into projected coordinate reference systems (also called Cartesian or rectangular coordinate reference systems) and geographic coordinate reference systems.
The use of Geographic Coordinate Reference Systems is very common. They use degrees of latitude and longitude and sometimes also a height value to describe a location on the earth’s surface. The most popular is called WGS 84. This is the one you will most likely be using, and if you get your data in latitude and longitude, then this is the CRS you are working in. It is also possible that you will be using a projected CRS. This two-dimensional coordinate reference system is commonly defined by two axes. At right angles to each other, they form a so called XY-plane. The horizontal axis is normally labelled X, and the vertical axis is normally labelled Y.
Working with data in the UK, on the other hand, you are most likely to be using British National Grid (BNG). The Ordnance Survey National Grid reference system is a system of geographic grid references used in Great Britain, different from using Latitude and Longitude. In this case, points will be defined by “Easting” and “Northing” rather than “Longitude” and “Latitude”. It basically divides the UK into a series of squares, and uses references to these to locate something. The most common usage is the six figure grid reference, employing three digits in each coordinate to determine a 100 m square. For example, the grid reference of the 100 m square containing the summit of Ben Nevis is NN 166 712. Grid references may also be quoted as a pair of numbers: eastings then northings in metres, measured from the southwest corner of the SV square. For example, the grid reference for Sullom Voe oil terminal in the Shetland Islands may be given as HU396753 or 439668,1175316
BNG
This will be important later on when we are linking data from different projections, or when you look at your map and you try to figure out why it might look “squished”.
We already mentioned lines that constitute objects of spatial data, such as streets, roads, railroads, etc. Networks constitute one-dimensional structures embedded in two or three dimensions. Discrete point objects may be distributed on the network, representing phenomena such as landmarks, or observation points. Mathematically, a network forms a graph, and many techniques developed for graphs have application to networks. These include various ways of measuring a network’s connectivity, or of finding the shortest path between pairs of points on a network. You can have a look at the lesson on network analysis in the QGIS documentation
One of the more useful concepts in spatial analysis is density - the density of humans in a crowded city, or the density of retail stores in a shopping centre. Mathematically, the density of some kind of object is calculated by counting the number of such objects in an area, and dividing by the size of the area. To read more about this, I recommend Silverman, Bernard W. Density estimation for statistics and data analysis. Vol. 26. CRC press, 1986.
Right so hopefully this gives you a few things to think about. Be sure that you are confident to know about:
Maps of the kind we will cover in this course are simply a form of data visualisation. In previous courses you may have learnt about histograms, scatterplots, and other forms of representing data in a two dimensional space. R is pretty good for producing data visualisation and there are three big approaches to producing this within R, which are rooted to particular packages. The oldest one is what people refer to as base R. The oldest R configuration has loads of plotting capabilities and follows a very particular philosophy about how to produce graphs. More modern packages are lattice, for multivariate data visualisation, and ggplot2, which relies in a theoretical model called the grammar of graphics.
In the same way, there are many different packages that can be used to produce maps, some of which rely on the functionality provided by base R and others that rely on ggplot2 or other external graphical packages. In this course we will play around with several of these R packages to produce maps. Many offer similar functionality, but they all have certain special advantages (and disadvantages). So, in practice you may shift among them depending on what it is that you want to achieve. We will start by looking at ggmap, which as the name hints rely on the broader ggplot2 package. It allows to easily produce maps with contextual information from static maps such as GoogleMaps, OpenStreet Maps, or Statemen maps.
We are going to start now with some code. So, it would be a good idea for you to make sure you have your RStudio project open and ready to go. As usual, first you will need to install this package in your machine, something you should know by now. Then load the package using the code below.
library(ggmap)
## Loading required package: ggplot2
In ggmap, downloading a map as an image and formatting the image for plotting is done with the get_map() function. As the most important characteristic of any map is location, the most important argument of get_map is the location argument. Ideally, location is a longitude/latitude pair specifying the center of the map and accompanied by a zoom argument, an integer from 3 to 20 specifying how large the spatial extent should be around the center, with 3 being the continent level and 20 being roughly the single building level. We can also use an additional argument, scale, to reduce of the quality (from 2 to 1) of the map we are downloading (this will speed the download).
#First we create an numeric object with the longitude and latitute for the University of Manchester
UoM <- c(lon= -2.233885, lat= 53.466852)
#Then we will get a map of the Uni, setting the zoom at 15 and the quality of the map at 1
map1 <- get_map(UoM, zoom = 15, scale= 1)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=53.466852,-2.233885&zoom=15&size=640x640&scale=1&maptype=terrain&language=en-EN&sensor=false
You will notice that a large object has been created in your global environment. This object has the basemap that we want to plot. Let’s do that. To visualise the map we can use the ggmap function gmap().
ggmap(map1)
This is what in the lecture today we described as a reference map. We often may want to use these reference maps as basemaps for our thematic maps. They may context and help with the interpretation. But what we want to learn in this course is about thematic maps, maps that tell stories and for that we need data. We will look at cool data in the next section, but before just a couple of things about these basemaps. We just looked at one option but there are others.
For example, you could look at the satellite pictures rather than the street address map. To do that, you simply change some of the arguments in the getmap function.
map1_sat <- get_map(UoM, zoom = 15, scale= 1, maptype = "satellite")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=53.466852,-2.233885&zoom=15&size=640x640&scale=1&maptype=satellite&language=en-EN&sensor=false
ggmap(map1_sat)
Or you could use basemap from providers other than Google, for example stamen maps. In this case, you pass two new arguments to get_map, a source argument that identifies where to get your basemap from and a type that specifies what particular stamen map you want (you can look at their flavors as described in their website). Here we are going to use the “toner” variant, which I like.
map1_stamen <- get_map(UoM, zoom = 15, scale= 1, source = "stamen", maptype = "toner")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=53.466852,-2.233885&zoom=15&size=640x640&scale=2&maptype=terrain&sensor=false
## Map from URL : http://tile.stamen.com/toner/15/16179/10601.png
## Map from URL : http://tile.stamen.com/toner/15/16180/10601.png
## Map from URL : http://tile.stamen.com/toner/15/16181/10601.png
## Map from URL : http://tile.stamen.com/toner/15/16179/10602.png
## Map from URL : http://tile.stamen.com/toner/15/16180/10602.png
## Map from URL : http://tile.stamen.com/toner/15/16181/10602.png
## Map from URL : http://tile.stamen.com/toner/15/16179/10603.png
## Map from URL : http://tile.stamen.com/toner/15/16180/10603.png
## Map from URL : http://tile.stamen.com/toner/15/16181/10603.png
## Map from URL : http://tile.stamen.com/toner/15/16179/10604.png
## Map from URL : http://tile.stamen.com/toner/15/16180/10604.png
## Map from URL : http://tile.stamen.com/toner/15/16181/10604.png
ggmap(map1_stamen)
We can play around with police recorded crime data, which can be downloaded from the police.uk website.
Let’s download some data for crime in Manchester.
To do this, open the data.police.uk/data website.
Date range just select one month of data. Choose whatever month you like. I will choose November 2017, so if you want to see the same results as will be here, pick that month.Force find Greater Manchester Police, and tick the box next to it.Data sets tick Include crime data.Generate File button.This will take you to a download page, where you have to click the Download now button. This will open a dialogue to save a .zip file. Navigate to the project directory folder you’ve created and save it there. Unzip the file.
#You can use the unzip function for this. A cool function you may want to use as well is file.choose(). If we pass this function as an argument to unzip(), we will get a pop window where we will be able to select our file using familiar point and click. Ideally, you want to rather use the pathfile to your file. But sometimes these shortcuts are convenient.
unzip(file.choose())
If you look at the Files window in the bottom right corner of RStudio you should see now a new subdirectory that contains a .csv file with the data that we need. Since I downloaded the data from November 2017 in my case this subdirectory is called 2017-11.
Before we can use this data we need to read it or import it into R and turn it into a dataframe object. To read in the .csv file, which is the format we just downloaded, the command is read.csv().
Again there are two ways to read in the data, if you want to open a window where you can manually navigate and open the file, you can pass file.choose() argument to the read.csv() function as illustrated earlier.
#This code creates a dataframe object called crimes which will include the spreadsheet in the file we have downloaded. In my case, that is 2007-11-greater-manchester-street.csv.
crimes <- read.csv(file.choose())
Or, if you know the path to your file, you can hardcode it in there, within quotation marks:
crimes <- read.csv("2017-11/2017-11-greater-manchester-street.csv")
You might notice that crimes has appeared in your work environment window. It will tell you how many observations (rows - and incidentally the number of recorded crimes in November 2017 within the GMP jurisdiction) and how many variables (columns) your data has.
Let’s have a look at the crimes dataframe:
#This will open the data browser in RStudio
View(crimes)
If you rather just want your results in the console, you can use the glimpse() function from the tibble package. This function does just that, it gives you a quick glimpse of the first few cases in the dataframe. Notice that there are two columns (Longitude and Latitude) that provide the require geographical coordinates that we need to plot this data.
library(tibble)
glimpse(crimes)
## Observations: 34,052
## Variables: 12
## $ Crime.ID <fctr> , f892dce3e7a4c45fe4f8f09f24d6a494f2b49...
## $ Month <fctr> 2017-11, 2017-11, 2017-11, 2017-11, 201...
## $ Reported.by <fctr> Greater Manchester Police, Greater Manc...
## $ Falls.within <fctr> Greater Manchester Police, Greater Manc...
## $ Longitude <dbl> -2.462774, -2.462774, -2.462774, -2.4644...
## $ Latitude <dbl> 53.62210, 53.62210, 53.62210, 53.61250, ...
## $ Location <fctr> On or near Scout Road, On or near Scout...
## $ LSOA.code <fctr> E01012628, E01012628, E01012628, E01004...
## $ LSOA.name <fctr> Blackburn with Darwen 018D, Blackburn w...
## $ Crime.type <fctr> Anti-social behaviour, Criminal damage ...
## $ Last.outcome.category <fctr> , Investigation complete; no suspect id...
## $ Context <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
In GIS settings you may have multiple vector data that you want to represent in the same map simultaneously. So creating maps is often a function of adding new layers to an existing map. Before we saw how to generate a basemap using ggmap. Now we are going to add a new layer to that basemap with information about the location of crimes (as indexed in our crime dataframe).
To do that we will use the ggmap() function but we will add a new geometric object (points) to the map using geom_point. Within the geom_point command we need to identify where the information is coming from, so we need to pass an argument identifying the dataframe object, and we need to identify how the aesthetics (in short, aes) are going to be defined (in this case, we need to identify the variables that define the location of the points in a two dimensional plane).
ggmap(map1_stamen) +
geom_point(aes(Longitude, Latitude), data = crimes)
## Warning: Removed 33530 rows containing missing values (geom_point).
Hmmm… The crime appear as black and white points, which is the same tonality of the map. Let’s change the colour to something else. We can do that including an aesthetic option in our code.
ggmap(map1_stamen) +
geom_point(aes(Longitude, Latitude, colour = "red"), data = crimes)
## Warning: Removed 33530 rows containing missing values (geom_point).
Keep in mind these are not exact locations, for privacy reasons the data in police.uk add noise to longitude and latitude so that it is harder to identify individuals based on this data. But this gives you an approximate idea of where crime takes place in and around the university area.
Let’s get more detailed information even. When you glimpse at the data, you may have noticed that one of the attributes in the dataframe was type of crime, as indexed by the Crime.type variable. We can look at the frequency distribution of this variable using the table() function.
table(crimes$Crime.type)
##
## Anti-social behaviour Bicycle theft
## 5417 356
## Burglary Criminal damage and arson
## 2774 3500
## Drugs Other crime
## 591 612
## Other theft Possession of weapons
## 2087 236
## Public order Robbery
## 4223 647
## Shoplifting Theft from the person
## 1427 751
## Vehicle crime Violence and sexual offences
## 2786 8645
Keep in mind that this is data for the whole of Greater Manchester. Let’s see how we can use this in our map. It’s very simple we can add a new aesthetic to the map. To do so we will pass a new argument to geom_point to ask R to use the information from Crime.type to use a different colour for each type of crime.
ggmap(map1_stamen) +
geom_point(aes(Longitude, Latitude, colour=Crime.type), data = crimes)
## Warning: Removed 33530 rows containing missing values (geom_point).
We are going to leave here for now. But if you want to understand better ggmap you may want to read this article by its developers. If you are not familiar with ggplot2 you may also want to have a look at this tutorial that we wrote.
HOMEWORK 1:
Think about this visualisation. 1. How yould you characterise the basemap as a vector map or as a raster image? 2. How yould you characterise the layer representing the crimes as a vector map or as a raster image? 3. Is the resulting visualisation clear? Is it helpful? If you could change anything, what would it be? Write your thoughts up.
This is going to be your first assignment to be submitted as part of next week. Export the image file into a Word document and write your answers there.
You can get a long way with spatial data stored in data frames, but it makes life easier if they are stored in special spatial objects. In the previous exercise we saw how we can easily display point patterns in ggmap just using data extracted from a dataframe with a longitude and latitude columns. In many instances, when you work with GIS you rely on spatial objects that are a bit more complex in structure and that are stored in a wide variery of proprietary and open source formats.
In this section you are going to learn how you take one of the most popular data formats for spatial objects, the shapefile, and read it into R. The shapefile was developed by ESRI, the developers and vendors or ArcGIS. And although many other formats have developed since and ESRI no longer holds the same market position it once occupied (though they’re still the player to beat), shapefiles continue to be one of the most popular formats you will encounter in your work. You can read more about shapefiles here.
We are going to learn here how to obtain shapefiles for British census geographies. In the class today we talked about the idea of neighborhouds and we explained how a good deal of sociological and criminological work traditionally used census geographies as proxies for neighbourhouds. As of today, they still are the geographical subdivisions for which we can obtain a greater number of attribute information (e.g., sociodemographics, etc.).
You can read more about census boundary data here. “Boundary data are a digitised representation of the underlying geography of the census”. Census Geography is often used in research and spatial analysis because it is divided into units based on population counts, created to form comparable units, rather than other administrative boundaries such as wards or police force areas. However depending on your research question and the context for your analysis, you might be using different units. The hierarchy of the census geographies goes from Country to Local Authority to Middle Layer Super Output Area (MSOA) to Lower Layer Super Output Area (LSOA) to Output Area:
Here we will get some boundaries for Manchester. Let’s use the LSOA level. These are geographical regions designed to be more stable over time and consistent in size than existing administrative and political boundaries. LSOAs comprise, on average, 600 households that are combined on the basis of spatial proximity and homogeneity of dwelling type and tenure.
So to get some boundary data, you can use the UK Data Service website. There is a simple Boundary Data Selector.
When you get to the link, you will see on the top there is some notification to help you with the boundary data selector. If you are feeling unsure at any point, feel free to click on that help to guide you.
For now, let’s focus on the selector options. Here you can choose the country you want to select shapefiles for. We select “England”. You can also choose the type of geography we want to use. Here we select “Statistical Building Block”, as discussed above. And finally you can select when you want it for. If you are working with historical data, it makes sense to find boundaries that match the timescale for your data. Here we will be dealing with contemporary data, and therefore we want to be able to use the newest available boundary data.
Once you have selected these options, click on the “Find” button. That will populate the box below:
Here you can select the boundaries we want. As discussed, we want the census lower super output areas. But again, your future choices here will depend on what data you want to be mapping.
Once you’ve made your choice, click on “List Areas”. This will now populate the box below. We are here concerned with Manchester. However you can select more than one if you want boundarie for more than one area as well. Just hold down “ctrl” to select multiple areas individually, or the shift key to select everything in between.
Once you’ve made your decision click on the “Extract Boundary Data” button. You will see the following message:
You can bookmark, or just stay on the page and wait. How long you have to wait will depend on how much data you have requested to download.
When your data is read, you will see the following message:
You have to right click on the “BoundaryData.zip”, and hit Save Target as on a PC or Save Link As on a Mac:
Navigate to the folder you have created for this analysis, and save the .zip file there. Extract the file contents using whatever you like to use to unzip compressed files.
#For example,
unzip("BoundaryData.zip", exdir="BoundaryData")
You should end up with a folder called “BoundaryData”. Have a look at its contents:
So you can see immediately that there are some documentations around the usage of this shapefile, in the readme and the terms and conditions. Have a look at these as they will contain information about how you can use this map. For example, all your maps will have to mention where you got all the data from. So since you got this boundary data from the UKDS, you will have to note the following:
“Contains National Statistics data © Crown copyright and database right [year] Contains OS data © Crown copyright [and database right] (year)”
You can read more about this in the terms and conditions document.
But then you will also notice that there are 4 files with the same name “england_oac_2011”. It is important that you keep all these files in the same location as each other! They all contain different bits of information about your shapefile (and they are all needed):
Sometimes there might be more files associated with your shapefile as well, but we will not cover them here. So unlike when you work with spreadsheets and data in tabular form, which typically is just all included in one file; when you work with spatial data, you have to live with the required information living in separate files that need to be stored together. So, being tidy and organised is even more important when you carry out projects that involve spatial data. Please do remember the suggestions we provided last week as to how to organise your RStudio project directories.
Traditionally spatial analysis in R were done using the sp package which creates a particular way of storing spatial objects in R. When most packages for spatial data analysis in R and for thematic cartography were first developed sp was the only way to work with spatial data in R. There are more than 450 packages rely on sp, making it an important part of the R ecosystem. More recently a new package, sf (which stantds for “simple features”), is revolutionising the way that R does spatial analysis. This new package provides a new way of storing spatial objects in R and most recent R packages for spatial analysis and cartography are using it as the new default. It is easy to transform sf objects into sp objects, so that those packages that still don’t use this new format can be used. But in this course we will emphasise the use of sf whenever possible. You can read more about the history of spatial packages and the sf package in the first two chapters of this book.
HOMEWORK 2
Read Section 2.1 of the Geocomputation book linked above. Answer the following questions: 1. Why is sf better? 2. What code do you need to transform a sf object into a sp object. 3 What is simply a sf object?
Install sf if you don’t already have. Then load it.
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
On Mac and Linux a few requirements must be met to install sf. These are described in the package’s README at github.com/r-spatial/sf.
To read in your data, you will need to know the path to where you have saved it. Ideally this will be in your working directory.
Let’s create an object and assign it our shapefile’s name:
#Remember to use the appropriate pathfile in your case
shp_name <- "BoundaryData/england_lsoa_2011.shp"
Make sure that this is saved in your working directory, and you have set your working directory.
Now use the st_read() function to read in the shapefile:
manchester_lsoa <- st_read(shp_name)
## Reading layer `england_lsoa_2011' from data source `C:\Users\Juanjo Medina\Dropbox\1_Teaching\1 Manchester courses\31152_60142 GIS and Crime Mapping\2018_labs\BoundaryData\england_lsoa_2011.shp' using driver `ESRI Shapefile'
## Simple feature collection with 282 features and 3 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 378833.2 ymin: 382620.6 xmax: 390350.2 ymax: 405357.1
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36 +units=m +no_defs
Now you have your spatial data file. You can have a look at what sort of data it contains, the same way you would view a dataframe, with the View() function:
View(manchester_lsoa)
## Observations: 282
## Variables: 4
## $ label <fctr> E08000003E02001062E01005066, E08000003E02001092E0100...
## $ name <fctr> Manchester 018E, Manchester 048C, Manchester 018A, M...
## $ code <fctr> E01005066, E01005073, E01005061, E01005062, E0100506...
## $ geometry <simple_feature> POLYGON ((384850 397432, 38..., POLYGON ((...
And of course, since it’s spatial data, you can finally map it:
plot(manchester_lsoa)
This is the main way that we will be creating maps. OK so you see that three maps appeared - any ideas why? Do you know what the three maps correspond to? Discuss.
Now let’s get some crime data to add to this map. We can do this by using the police.uk data we obtained earlier. Have a look again at the information stored in the crimes spreadsheet:
glimpse(crimes)
## Observations: 34,052
## Variables: 12
## $ Crime.ID <fctr> , f892dce3e7a4c45fe4f8f09f24d6a494f2b49...
## $ Month <fctr> 2017-11, 2017-11, 2017-11, 2017-11, 201...
## $ Reported.by <fctr> Greater Manchester Police, Greater Manc...
## $ Falls.within <fctr> Greater Manchester Police, Greater Manc...
## $ Longitude <dbl> -2.462774, -2.462774, -2.462774, -2.4644...
## $ Latitude <dbl> 53.62210, 53.62210, 53.62210, 53.61250, ...
## $ Location <fctr> On or near Scout Road, On or near Scout...
## $ LSOA.code <fctr> E01012628, E01012628, E01012628, E01004...
## $ LSOA.name <fctr> Blackburn with Darwen 018D, Blackburn w...
## $ Crime.type <fctr> Anti-social behaviour, Criminal damage ...
## $ Last.outcome.category <fctr> , Investigation complete; no suspect id...
## $ Context <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
You should be able to see that there is a variable, a column in this spreadsheet, called LSOA.code. Yep, that is the unique identifier that is telling us in which lower super output area each crime took place. If only we could use this information to create a new dataset counting the number of criminal events that took place within each of these areas!!!
Ok, here is where you are introduced to the wonderful world of dplyr. This is a package for conducting all sorts of operations with data frames. We are not going to cover the full functionality of dplyr (which you can consult in this tutorial), but we are going to cover three different very useful elements of dplyr: the select function, the group_by function, and the piping operator.
Load the library:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The select() function provides you with a simple way of subsetting columns from a data frame. So, say we just want to use one variable, LSOA.code, from the crimes dataframe and store it in a new object we could write the following code:
new_object <- select(crimes, LSOA.code)
We can also use the group_by() function for performing group operations. Essentially this function ask R to group cases within categories and then do something with thosed grouped cases. So, say, we want to count the number of cases within each LSOA, we could use the following code:
#First we group the cases by LSOA code and stored this organised data into a new object
grouped_crimes <- group_by(new_object, LSOA.code)
#Then we could count the number of cases within each category and use the summarise function to print the results
summarise(grouped_crimes, count=n())
#We could infact create a new dataframe with these results
crime_per_LSOA <- summarise(grouped_crimes, count=n())
As you can see we can do what we wanted, create a new dataframe with the required info, but there is a more efficient way of doing this, without so many intermediate steps clogging up our environment with unnecessary objects. That’s where the piping operator comes handy. The piping operator is written like %>% and it can be read as “and then”. Look at the code below:
#First we say create a new object called crime_per_lsoa, and then select only the LSOA.code column to exist in this object, and then group this object by the LSOA.code, and then count the number of cases within each category, this is what I want in the new object.
crimes_per_lsoa <- crimes %>%
select(LSOA.code) %>%
group_by(LSOA.code) %>%
summarise(count=n())
Essentially we obtain the same results but with more streamlined and elegant code, and not needing additional objects in our environment.
Notice anything similar between the data from the shapefile and the frequency table data we just created? Do they share a column?
Yes! You might notice that the LSOA.code field in the crimes data matches the values in the code field in the spatial data. In theory we could join these two data tables.
So how do we do this? Well what you can do is to link one data set with another. Data linking is used to bring together information from different sources in order to create a new, richer dataset. This involves identifying and combining information from corresponding records on each of the different source datasets. The records in the resulting linked dataset contain some data from each of the source datasets. Most linking techniques combine records from different datasets if they refer to the same entity. (An entity may be a person, organisation, household or even a geographic region.)
You can merge (combine) rows from one table into another just by pasting them in the first empty cells below the target table—the table grows in size to include the new rows. And if the rows in both tables match up, you can merge columns from one table with another by pasting them in the first empty cells to the right of the table—again, the table grows, this time to include the new columns.
Merging rows is pretty straightforward, but merging columns can be tricky if the rows of one table don’t always line up with the rows in the other table. By using left_join() from the dplyr package, you can avoid some of the alignment problems.
left_join() will return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
So we’ve already identified that both our crimes data, and the spatial data contain a column with matching values, the codes for the LSOA that each row represents.
You need a unique identifier to be present for each row in all the data sets that you wish to join. This is how R knows what values belong to what row! What you are doing is matching each value from one table to the next, using this unique identified column, that exists in both tables. For example, let’s say we have two data sets from some people in Hawkins, Indiana. In one data set we collected information about their age. In another one, we collected information about their hair colour. If we collected some information that is unique to each observation, and this is the same in both sets of data, for example their names, then we can link them up, based on this information. Something like this:
And by doing so, we produce a final table that contains all values, lined up correctly for each individual observation, like this:
This is all we are doing, when merging tables, is we are making use that we line up the correct value for all the variables, for all our observations.
Well actually there is a whole family of join functions as part of dplyr. But here we use left join, because that way we keep all the rows in x (the left-hand side dataframe), and join to it all the matched columns in y (the right-hand side dataframe).
So let’s join the crimes data to the spatial data, using left_join():
We have to tell lefot_join what are the dataframes we want to join, as well as the names of the columns that contain the matching values in each one. This is “code” in the manchester_lsoa dataframe and “LSOA.code” in the crimes_per_lsoa dataframe. Like so:
manchester_lsoa <- left_join(manchester_lsoa, crimes_per_lsoa, by = c("code"="LSOA.code"))
## Warning: Column `code`/`LSOA.code` joining factors with different levels,
## coercing to character vector
Now if you have a look at the data again, you will see that the column of number of crimes (n) has been added on.
You can now use this to create a thematic choropleth map
plot(manchester_lsoa[4])
Very quickly, but just to illustrate things can be prettier, we are going to use this data with another package tmap, short for thematic maps. This package also borrows from the ggplot syntax and is specifically designed to make creation of thematic maps more convenient. It takes care of a lot of the styling and aesthetics. This reduces our amount of code significantly. So, look at what we can do with our previous map:
library(tmap)
tm_shape(manchester_lsoa) +
tm_polygons("count", style="quantile", title="Count of crimes in Manchester")
And we can even add some interactivity!
tmap_mode("view")
## tmap mode set to interactive viewing
last_map()
So, this is all for today, next week we will come back to tmap and explain the different arguments that we use when thematic maps. This was just an introduction to some of the things we can do. Next week we will spend a bit of more time discussing how to make good choices when producing maps.